It takes hierarchy to design multimillion-gate chips on manageable and predictable schedules. Ensuring that timing will converge to a chosen goal requires early timing budgets, abstraction of simplified block-routing and timing models, and proper margins along several axes. This article details the approach Morphics Technology uses to build 5 million- to 15 million-gate chips for wireless signal processing that meet schedule, functionality and timing goals on first silicon. The challenge for on-schedule physical implementation of multimillion-gate chips starts with early floor planning and partitioning, and continues throughout the design flow with appropriate abstraction and approximations to get the most benefit out of all work expended. To achieve timing closure, each stage of the process must include sufficient margin, and the project must focus on moving ahead to avoid spending too much time on premature optimizations.
As a design closes in on tapeout, several issues must converge simultaneously, and a useful concept is to incrementally relax added margins toward the desired target goals.
A truly hierarchical flow supports making replicated instances of blocks that share a single abstraction each of their logic, timing, routing and port-location models. In particular, the physical routing, parasitic extraction and static timing-analysis steps need to be separated so that top-level runs use only abstractions of instantiated blocks without seeing the full transistor, gate or polygon databases within each of the blocks.
Partitioning of a design serves to break it into manageable pieces that can benefit from the parallelized effort of the individuals in a team. The goal is to allow separable progress of the work both for individual blocks and concurrently at the top level. A good goal is to seek “equalized pain” between blocks and their parents in a hierarchical design, making the block size small enough so that the effort of routing and timing closure at a block level is about the same as the effort required for the parent.
A good metric for the use of hierarchy is the “hierarchical reuse factor,” which is the ratio of the number of block instances to the number of block types. Another good principle in choosing the granularity of partitioning is to ensure that no individual run takes more than 20 hours. Partitioning of blocks so that individual runs take a day or less allows valuable iterations to proceed with reasonable cycle times of a few days per turn, including designer time to analyze results.
Given today's tools, we have found that a good rule of thumb is to seek blocks that have about 150,000 placeable instances, or around 400,000 gates. Even though tools could support blocks several times this size, individual blocks with 1 million gates just take too long in run-time for all steps and are too close to failing completely due to lack of real or virtual memory, even on machines with many gigabytes.
Since early judgment is important, a powerful concept is to use a linear “signal velocity” metric that allows top-level timing before the actual placement of repeaters.
Even after doing the hard work of partitioning and floor planning, one of the classic traps some design approaches fall into is to then choose analysis methods that don't preserve the isolation between parent and child in the hierarchy.
In modern 0.18-micron or smaller technology, minimum-pitched wires are taller than they are wide, and this means that cross-coupling to neighboring signals can often be in excess of 50 percent of a wire's total capacitance. The effect of simultaneous switching cannot be ignored, but it is also unrealistic to seek a precise determination of when every coupling combination can occur over the range of process spread. Therefore, safe and successful timing convergence requires conservative choices that bound delay calculation by minimum and maximum values, rather than hopelessly seeking to find a single “exact” value.
A good approach is to map cross-coupling capacitance into bounded “effective” capacitance. Fig. 1 shows the possibilities of aggressors switching in either the opposite or same direction as the victim signal under analysis. While it is possible for a fast opposite-direction aggressor to have an effective capacitance of three or more times the actual nominal cross-coupling, it is a reasonable approximation to set the effective cross-coupling capacitance to twice the nominal capacitance. That tactic is still much more conservative than just neglecting the capacitance-multiplying effect of the switching.
Likewise, although it is possible for a fast same-direction aggressor to so help the transition of a slow victim that the effective coupling capacitance should actually be negative, it is reasonable just to set the minimum effective capacitance value to zero. Note that when complete complementary timing checks use both maximum and minimum capacitances, making the minimum capacitance smaller is overly pessimistic.
Even after making conservative choices for the handling of cross-coupling, it is still important to add extra margin to account for the effects on timing of many other factors such as process spread, variations in dielectric thickness or permitivity, on-chip process tilt, on-chip variation in power-supply voltage drops and inaccuracies in extraction and transistor characterizations. The minimum and maximum resistances and capacitances for every net allow us to calculate minimum and maximum delays for every net, which can be back-annotated using .sdf files into timing analyses. In a register (edge-triggered-flop) based design, think of every setup-and-hold check as determining the results of a race ending at the receiving register. Both sides should be viewed as complementary duals of each other, where, as shown in Fig. 2, every setup check is based upon using the maximum delays through the launching register and combinational logic up to the receiving register, and the exact same paths with minimum delays are used for a hold check.
Furthermore, the paths used for these checks need to include the clock trees going back to the point of reconvergence between the parts of the clock-distribution tree feeding the launching and receiving register. Thus, ordinary setup-and-hold checks also validate the quality of the clock-distribution tree, emphasizing local skew instead of global skew. They allow for the possibilities of utilizing “useful skew,” as well as not penalizing bad skew where it wouldn't make any difference to the affected setup or hold checks anyway.
This rigorous and safe approach to clock tree analysis allows a much lower-power clock-distribution scheme using tall clock trees with limited average fan-out, rather than the overkill of more common “short, fat” solutions that sometimes expend a third of the switching capacitance of the chip just in distributing a core clock.
Where hold checks do show violations with respect to a chosen added amount of margin, delay elements can be inserted. Sometimes, because of the conservative handling of minimum and maximum delay calculations, it can be a tricky balancing act to add delay elements in just the right way to fix the hold violations without degrading the setup paths too much. A useful concept is to write scripts that search the nets selected by fan-in cones of each receiving-register input that has a hold violation (using minimum-delay calculations) for the upstream point with the maximum setup slack (under the maximum-delay calculations), and insert the delay buffer into that net.
As shown in Fig. 3, often this means the hold-fix delay element must be inserted at a point adjacent to neither the launching register's output nor the receiving register's input, which is where most simplistic hold-fix algorithms usually would have placed them.
In the later stages of timing closure, the actual results from block-level static timing-analysis runs can be used to create models of the blocks for use as objects in a top-level static timing-analysis run. Of course, replicated instances will need only a single timing model for the block type, consistent with the choice that timing analysis of the blocks is not complicated by the instance-specific wires of overlying routing in the parent.
Layered goals
In the early stages of physical implementation, it is best to set high goals for both internal-block timing and top-level timing (as judged by linearized signal velocity, discussed earlier). As a design progresses toward tapeout, and the top-level timing gets replaced with actual timing models derived from routed and timed blocks, the goals can be relaxed toward the eventual tapeout requirement target.
The objective is to ensure convergence by seeking to “touch” fewer and fewer nets and objects with each fix/reroute/retime iteration. A good goal is to see that the number of touched nets decreases by a factor of four to eight for each iteration. Even slight reductions in the goals sought at each iteration aid greatly in convergence. Fig. 4 shows this principle, where the y axis is measured as factors with respect to the eventual target for each quantity:
Criteria ... Quantity
Setup checks ... Clock Frequency
Hold checks ... Skew margin
Antenna checks Allowable charge ratio
In effect, convergence proceeds along these three axes simultaneously. Even though earlier passes do somewhat more “work” by seeking to fix issues based on a stricter criterion, this approach lessens the number of items that have to be re-worked when their neighboring wires or objects get bumped. For example, violations in the allowable antenna charge ratio (a rule aiding yield by limiting the ratio between the area of metal wires and the polysilicon gate area they connect to) are easily fixed. But these fixes touch routing, and can disrupt tight setup or hold paths. So, convergence is improved by simultaneously and incrementally lowering the bar on all goals toward the required targets.
Successful closure
Adding sufficient margins early allows a design to converge with only a handful of iterations for each block type and for the
top level. By taking a conservative design style with full complementary checking of both sides of every setup-and-hold race, success is built into the process. Otherwise, a design team risks being bitten by unaccounted-for coupling or noise issues found only after the silicon returns.
The conservative delay-calculation metrics discussed also mean that an upside can be expected. By choosing to base the delay calculations on worst-case values, a typical process spread, as shown in Fig. 5, will actually produce most parts well above the timing predicted by the worst-case model.
Moreover, the extra multiplicative timing margins built in by the factor x in equations 1 and 2 are an additional upside factor between the “guaranteed” goal stated at tapeout and the actual attained clock frequencies proven by testing of the finished packaged parts across environmental conditions.